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Abstract 

This paper presents a statistical parser for 
natural language that obtains a parsing 
accuracy — roughly 87% precision and 86% 
recall — which surpasses the best previously 
published results on the Wall St. Journal 
domain. The parser itself requires very lit- 
tle human intervention, since the informa- 
tion it uses to make parsing decisions is 
specified in a concise and simple manner, 
and is combined in a fully automatic way 
under the maximum entropy framework. 
The observed running time of the parser on 
a test sentence is linear with respect to the 
sentence length. Furthermore, the parser 
returns several scored parses for a sentence, 
and this paper shows that a scheme to pick 
the best parse from the 20 highest scoring 
parses could yield a dramatically higher ac- 
curacy of 93% precision and recall. 

1 Introduction 

This paper presents a statistical parser for natural 
language that finds one or more scored syntactic 
parse trees for a given input sentence. The parsing 
accuracy — roughly 87% precision and 86% recall — 
surpasses the best previously published results on 
the Wall St. Journal domain. The parser consists of 
the following three conceptually distinct parts: 

1 . A set of procedures that use certain actions to 
incrementally construct parse trees. 

2. A set of maximum entropy models that com- 
pute probabilities of the above actions, and ef- 
fectively "score" parse trees. 

3. A search heuristic which attempts to find the 
highest scoring parse tree for a given input sen- 
tence. 

The maximum entropy models used here are simi- 
lar in form to those in (Ratnaparkhi, 1996; Berger, 
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Delia Pietra, and Delia Pietra, 1996; Lau, Rosen- 
feld, and Roukos, 1993). The models compute the 
probabilities of actions based on certain syntactic 
characteristics, or features, of the current context. 
The features used here are defined in a concise and 
simple manner, and their relative importance is de- 
termined automatically by applying a training pro- 
cedure on a corpus of syntactically annotated sen- 
tences, such as the Penn Treebank (Marcus, San- 
torini, and Marcinkiewicz, 1994). Although creat- 
ing the annotated corpus requires much linguistic 
expertise, creating the feature set for the parser it- 
self requires very little linguistic effort. 

Also, the search heuristic is very simple, and its 
observed running time on a test sentence is linear 
with respect to the sentence length. Furthermore, 
the search heuristic returns several scored parses for 
a sentence, and this paper shows that a scheme to 
pick the best parse from the 20 highest scoring parses 
could yield a dramatically higher accuracy of 93% 
precision and recall. 

Sections |[ [|, and |I] describe the tree-building 
procedures, the maximum entropy models, and the 
search heuristic, respectively. Section [| describes 
experiments with the Penn Treebank and section 
H compares this paper with previously published 
works. 

2 Procedures for Building Trees 

The parser uses four procedures, TAG, chunk, 
build, and CHECK, that incrementally build parse 
trees with their actions. The procedures are ap- 
plied in three left-to-right passes over the input sen- 
tence; the first pass applies TAG, the second pass ap- 
plies CHUNK, and the third pass applies build and 
CHECK. The passes, the procedures they apply, and 
the actions of the procedures are summarized in ta- 
ble |l| and described below. 

The actions of the procedures are designed so that 
any possible complete parse tree T for the input sen- 
tence corresponds to exactly one sequence of actions; 
call this sequence the derivation of T . Each proce- 
dure, when given a derivation d — {ai . . . a n }, pre- 



Pass 


Procedure 


Actions 


Description 


First Pass 


TAG 


A POS tag in tag set 


Assign POS Tag to word 


Second Pass 


CHUNK 


Start X, Join X, Other 


Assign Chunk tag to POS tag and 
word 


Third Pass 


BUILD 


Start X, Join X, where X is a 
constituent label in label set 


Assign current tree to start a new 
constituent, or to join the previ- 
ous one 


CHECK 


Yes, No 


Decide if current constituent is 
complete 



Table 1: Tree-Building Procedures of Parser 



diets some action a n+ i to create a new derivation 
d! = {a\ . . . a n+ i}. Typically, the procedures postu- 
late many different values for a n +i, which cause the 
parser to explore many different derivations when 
parsing an input sentence. But for demonstration 
purposes, figures |l|-|^ trace one possible derivation 
for the sentence "I saw the man with the telescope" , 
using the part-of-speech (POS) tag set and con- 
stituent label set of the Penn treebank. 

2.1 First Pass 

The first pass takes an input sentence, shown in fig- 
ure |l|, and uses TAG to assign each word a POS tag. 
The result of applying TAG to each word is shown in 
figure H 

2.2 Second Pass 

The second pass takes the output of the first pass 
and uses chunk to determine the "flat" phrase 
chunks of the sentence, where a phrase is "flat" if 
and only if it is a constituent whose children consist 
solely of POS tags. Starting from the left, CHUNK 
assigns each (word, POS tag) pair a "chunk" tag, ei- 
ther Start X, Join X, or Other. Figure^ shows the 
result after the second pass. The chunk tags are then 
used for chunk detection, in which any consecutive 
sequence of words w m . . . w n (m < n) are grouped 
into a "flat" chunk X if w m has been assigned Start 
X and w m +i . . . w n have all been assigned Join X. 
The result of chunk detection, shown in figure 0, is 
a forest of trees and serves as the input to the third 
pass. 

2.3 Third Pass 

The third pass always alternates between the use 
of build and check, and completes any remain- 
ing constituent structure, build decides whether 
a tree will start a new constituent or join the in- 
complete constituent immediately to its left. Ac- 
cordingly, it annotates the tree with either Start X, 
where X is any constituent label, or with Join X, 
where X matches the label of the incomplete con- 
stituent to the left, build always processes the 
leftmost tree without any Start X or Join X an- 
notation. Figure H shows an application of build 



Procedure 


Actions 


Similar 
Shift-Reduce 
Parser Action 


CHECK 


No 


shift 


CHECK 


Yes 


reduce a, where 
a is CFG 
rule of proposed 
constituent 


BUILD 


Start X, Join X 


Determines a for 
subsequent re- 
duce operations 



Table 2: Comparison of BUILD and CHECK to oper- 
ations of a shift-reduce parser 



in which the action is Join VP. After build, con- 
trol passes to check, which finds the most recently 
proposed constituent, and decides if it is complete. 
The most recently proposed constituent, shown in 
figure ||, is the rightmost sequence of trees t m . . . t n 
(m < n) such that t rn is annotated with Start X 
and t m +i . . . t n are annotated with Join X. If check 
decides yes, then the proposed constituent takes its 
place in the forest as an actual constituent, on which 
build does its work. Otherwise, the constituent is 
not finished and build processes the next tree in 
the forest, t n+ i. check always answers no if the 
proposed constituent is a "flat" chunk, since such 
constituents must be formed in the second pass. Fig- 
ure R shows the result when check looks at the pro- 
posed constituent in figure ^| and decides No. The 
third pass terminates when check is presented a 
constituent that spans the entire sentence. 

Table ^| compares the actions of BUILD and CHECK 
to the operations of a standard shift-reduce parser. 
The No and Yes actions of check correspond to the 
shift and reduce actions, respectively. The impor- 
tant difference is that while a shift-reduce parser 
creates a constituent in one step (reduce a) , the pro- 
cedures build and check create it over several steps 
in smaller increments. 



I saw the man with the telescope 



Figure 1: Initial Sentence 



PRP VBD DT NN IN DT NN 

I I I I I I I 

I saw the man with the telescope 

Figure 2: The result after First Pass 



Start NP Other Start NP Join NP Other Start NP Join NP 

III III I 

PRP VBD DT NN IN DT NN 

III III I 

I saw the man with the telescope 

Figure 3: The result after Second Pass 



NP VBD NP IN 

PRP saw DT NN w 'th 

I I I 

I the man 




Figure 4: The result of chunk detection 
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Figure 5: An application of build in which Join VP is the action 
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Figure 6: The most recently proposed constituent (shown under ?) 
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Figure 7: An application of CHECK in which No is the action, indicating that the proposed constituent in 
figure H is not complete, build will now process the tree marked with ? 



3 Probability Model 

This paper takes a "history-based" approach (Black 
et al., 1993) where each tree-building procedure uses 
a probability model p(a\b), derived from p(a,b), to 
weight any action a based on the available context, 
or history, b. First, we present a few simple cate- 
gories of contextual predicates that capture any in- 
formation in b that is useful for predicting a. Next, 
the predicates are used to extract a set of features 
from a corpus of manually parsed sentences. Finally, 
those features are combined under the maximum en- 
tropy framework, yielding p(a, b). 

3.1 Contextual Predicates 

Contextual predicates are functions that check for 
the presence or absence of useful information in a 
context b and return true or false accordingly. The 
comprehensive guidelines, or templates, for the con- 
textual predicates of each tree building procedure 
are given in table [| The templates use indices 
relative to the tree that is currently being modi- 
fied. For example, if the current tree is the 5th 
tree, cons(— 2) looks at the constituent label, head 
word, and start /join annotation of the 3rd tree in 
the forest. The actual contextual predicates are gen- 
erated automatically by scanning the derivations of 
the trees in the manually parsed corpus with the 
templates. For example, an actual contextual pred- 
icate based on the template cons(O) might be "Does 
cons(O) = { NP, lie } ?" Constituent head words 
are found, when necessary, with the algorithm in 
(Magerman, 1995). 

Contextual predicates which look at head words, 
or especially pairs of head words, may not be re- 
liable predictors for the procedure actions due to 
their sparseness in the training sample. Therefore, 
for each lexically based contextual predicate, there 
also exist one or more corresponding less specific, 
or "backed-off" , contextual predicates which look 
at the same context, but omit one or more words. 
For example, the contexts cons(0, 1*), cons(0*,l), 
cons(0*, 1*) are the same as cons(0, 1) but omit ref- 
erences to the head word of the 1st tree, the Oth 
tree, and both the Oth and 1st tree, respectively. 
The backed-off contextual predicates should allow 
the model to provide reliable probability estimates 



when the words in the history are rare. Backed-off 
predicates are not enumerated in table ||, but their 
existence is indicated with a * and t . 

3.2 Maximum Entropy Framework 

The contextual predicates derived from the tem- 
plates of table H are used to create the features nec- 
essary for the maximum entropy models. The pred- 
icates for TAG, chunk, build, and CHECK are used 
to scan the derivations of the trees in the corpus to 
form the training samples 7^ ag , % hunk , T huM , and 
^check, respectively. Each training sample has the 
form T = {(ai, 6i), (a 2 , b 2 ), (a N ,b N )}, where a, 
is an action of the corresponding procedure and bi 
is the list of contextual predicates that were true in 
the context in which at was decided. 

The training samples are respectively used to cre- 
ate the models p tag , Pchunk, Pbuiid, and p chc ck, all of 
which have the form: 

p(a,b) = nf[a f /^ b) (1) 

i=l 

where a is some action, b is some context, it is a nor- 
malization constant, ctj are the model parameters, 
< a.j < oo, and fj(a, b) € {0, 1} are called features, 
j = {1 . . . k}. Features encode an action a' as well 
as some contextual predicate cp that a tree-building 
procedure would find useful for predicting the action 
al . Any contextual predicate cp derived from table | 
which occurs 5 or more times in a training sample 
with a particular action a! is used to construct a 
feature jf 

f , t \ _ / 1 if cp(b) = true && a = a' 
' \ otherwise 

for use in the corresponding model. Each feature fj 
corresponds to a parameter ctj , which can be viewed 
as a "weight" that reflects the importance of the 
feature. 

The parameters {ai...a n } are found automat- 
ically with Generalized Iterative Scaling (Darroch 
and Ratcliff, 1972), or GIS. The GIS procedure, as 
well as the maximum entropy and maximum likeli- 
hood properties of the distribution of form (jl|) , are 
described in detail in (Ratnaparkhi, 1997). In gen- 
eral, the maximum entropy framework puts no lim- 
itations on the kinds of features in the model; no 



Model 


Categories 


Description 


Templates Used 


TAG 


See (Ratnaparkhi, 1996) 


CHUNK 


chunkandpostag(n)* 


The word, POS tag, and chunk tag of nth 
leaf. Chunk tag omitted if n > 0. 


chunkandpostag(O), 
chunkandpostag(— 1), 
chunkandpostag(— 2) 
chunkandpostag(l), 
chunkandpostag(2) 


chunkandpostag(rn, n)* 


chunkandpostag(rn) & chunkandpostag(n) 


chunkandpostag(— 1, 0), 
chunkandpostag(0, 1) 


BUILD 


cons(n) 


The head word, constituent (or POS) la- 
bel, and start/join annotation of the nth 
tree. Start/join annotation omitted if 
n > 0. 


cons(0), cons( — 1), cons(— 2), 
cons(l), cons(2) 


cons(m, n)* 


cons(m) & cons(n) 


cons(— 1, 0), cons(0, 1) 


cons(rn, n, pY 


cons(rn), cons(n), & cons(p). 


cons(0, -1, -2), cons(0, 1,2), 
cons(-l, 0, 1) 


punctuation 


The constituent we could join (1) contains 
a "[" and the current tree is a "]"; (2) 
contains a "," and the current tree is a "," ; 
(3) spans the entire sentence and current 
tree is "." 


bracketsmatch, iscomma, 
endofsentence 


CHECK 


checkcons(n)* 


The head word, constituent (or POS) la- 
bel of the nth tree, and the label of pro- 
posed constituent, begin and last are 
first and last child (resp.) of proposed 
constituent. 


checkcons(Zast), 
checkcons(fregm) 


checkcons(m, n)* 


checkcons(m) & checkcons(n) 


checkcons(z, last), begin < 
i < last 


production 


Constituent label of parent {X), and 
constituent or POS labels of children 
(X\ . . . X n ) of proposed constituent 


productions^ — > X\ . . . X n 


surround(n)* 


POS tag and word of the nth leaf to the 
left of the constituent, if n < 0, or to the 
right of the constituent, if n > 


surround(l), surround(2), 
surround(— 1), surround(— 2) 



Table 3: Contextual Information Used by Probability Models (* = all backed-off contexts are used, t = only 
backed-off contexts that include head word of current tree, i.e., 0th tree, are used) 



q(a\b) 



special estimation technique is required to combine 
features that encode different kinds of contextual 
predicates, like punctuation and cons(0, 1,2). As a 
result, experimenters need only worry about what 
features to use, and not how to use them. 

We then use the models p ta g, Pchunk, Pbuild, and 
Pcheck to define a function score, which the search 
procedure uses to rank derivations of incomplete and 
complete parse trees. For each model, the corre- 
sponding conditional probability is defined as usual: 

For notational convenience, define q as follows 

Ptag(a|&) if a is an action from TAG 

Pchunk( fl |&) if a is an action from chunk 

Ptmiid(a|fr) if a is an action from build 

Pcheck(a|fr) if a is an action from check 

Let deriv(T) = {ai, . . . ,a n } be the derivation of a 
parse T, where T is not necessarily complete, and 
where each is an action of some tree-building 
procedure. By design, the tree-building procedures 
guarantee that {ai, . . . , a n } is the only derivation for 
the parse T. Then the score of T is merely the prod- 
uct of the conditional probabilities of the individual 
actions in its derivation: 

score(T) = JJ q(pi\bi) 

aiGderiv(T) 

where bi is the context in which a, was decided. 
4 Search 

The search heuristic attempts to find the best parse 
T*, defined as: 

T* = arg max score(T) 
Tetrees(S) 

where trees(5) are all the complete parses for an 
input sentence S. 

The heuristic employs a breadth-first search 
(BFS) which does not explore the entire frontier, 
but rather, explores only at most the top K scor- 
ing incomplete parses in the frontier, and terminates 
when it has found M complete parses, or when all 
the hypotheses have been exhausted. Furthermore, 
if {ai . . . a„} are the possible actions for a given 
procedure on a derivation with context 6, and they 
are sorted in decreasing order according to q(di\b), 
we only consider exploring those actions {a\ . . . a m } 
that hold most of the probability mass, where m is 
defined as follows: 



= max > q(a,i\b) < Q 

rn. ' * 



and where Q is a threshold less than 1. The search 
also uses a Tag Dictionary constructed from train- 
ing data, described in (Ratnaparkhi, 1996), that re- 
duces the number of actions explored by the tag- 
ging model. Thus there are three parameters for the 



search heuristic, namely K,M, and Q and all exper- 
iments reported in this paper use K = 20, M = 20, 
and Q = .93d Table | describes the top K BFS and 
the semantics of the supporting functions. 

It should be emphasized that if K > 1, the parser 
does not commit to a single POS or chunk assign- 
ment for the input sentence before building con- 
stituent structure. All three of the passes described 
in section ^| are integrated in the search, i.e., when 
parsing a test sentence, the input to the second pass 
consists of K of the best distinct POS tag assign- 
ments for the input sentence. Likewise, the input 
to the third pass consists of K of the best distinct 
chunk and POS tag assignments for the input sen- 
tence. 

The top K BFS described above exploits the ob- 
served property that the individual steps of correct 
derivations tend to have high probabilities, and thus 
avoids searching a large fraction of the search space. 
Since, in practice, it only does a constant amount of 
work to advance each step in a derivation, and since 
derivation lengths are roughly proportional to the 
sentence length, we would expect it to run in lin- 
ear observed time with respect to sentence length. 
Figure || confirms our assumptions about the linear 
observed running time. 

5 Experiments 

The maximum entropy parser was trained on sec- 
tions 2 through 21 (roughly 40000 sentences) of 
the Penn Treebank Wall St. Journal corpus, release 
2 (Marcus, Santorini, and Marcinkiewicz, 1994), and 
tested on section 23 (2416 sentences) for compar- 
ison with other work. All trees were stripped of 
their semantic tags (e.g., -L0C, -BNF, etc.), coref- 
erence information(e.g., *-l), and quotation marks 
( 1 1 and ' ' ) for both training and testing. The PAR- 
SEVAL (Black and others, 1991) measures compare 
a proposed parse P with the corresponding correct 
treebank parse T as follows: 



Recall 
Precision 



# correct constituents in P 

# constituents in T 

# correct constituents in P 

# constituents in P 



A constituent in P is "correct" if there exists a con- 
stituent in T of the same label that spans the same 
words. Table [| shows results using the PARSE VAL 
measures, as well as results using the slightly more 
forgiving measures of (Collins, 1996) and (Mager- 
man, 1995). Table | shows that the' maximum en- 
tropy parser performs better than the parsers pre- 
sented in (Collins, 1996) and (Magerman, 1995)0, 



The parameters K,M, and Q were optimized on 
"held out" data separate from the training and test sets. 

2 Results for SPATTER on section 23 are reported in 
(Collins, 1996) 



advance : 



insert : 
extract : 

completed : 



d x V — ► di . . . d n 



dx h 

h — ► 

d 



void 



{true , false} 



/* Applies relevant tree building procedure to d 

and returns list of new derivations whose action 

probabilities pass the threshold Q */ 

/* inserts d in heap h */ 

/* removes and returns derivation in h 

with highest score */ 

/* returns true if and only if 

d is a complete derivation */ 



M = 20 
K = 20 
Q = .95 

C = <empty heap> 
ho =<input sentence> 
while ( \C\ < M ) 

if ( Vi,hi is empty ) 

then break 
i = max{i | hi is non-empty} 
sz = min(isr, \hi\) 
for j = 1 to sz 

d\...d p = advance ( extract (hi) , V ) 
for q= 1 to p 

if (completedWg) ) 
then insert (d q , C) 
else insert (d q , h i+ i) 



/* Heap of completed parses */ 
/* hi contains derivations of length 



i */ 



Table 4: Top K BFS Search Heuristic 



Seconds 




30 40 50 

Sentence Length 



70 



Figure 8: Observed running time of top K BFS on Section 23 of Pcnn Treebank WSJ, using one 167Mhz 
UltraSPARC processor and 256MB RAM of a Sun Ultra Enterprise 4000. 



Parser 


Precision 


Recall 


Maximum Entropy 


86.8% 


85.6% 


Maximum Entropy* 


87.5% 


86.3% 


(Collins, 1996)* 


85.7% 


85.3% 


(Magerman, 1995)* 


84.3% 


84.0% 



Table 5: Results on 2416 sentences of section 23 
(0 to 100 words in length) of the WSJ Treebank. 
Evaluations marked with ° ignore quotation marks. 
Evaluations marked with * collapse the distinction 
between ADVP and PRT, and ignore all punctuation. 



which have the best previously published parsing ac- 
curacies on the Wall St. Journal domain. 

It is often advantageous to produce the top N 
parses instead of just the top 1, since additional in- 
formation can be used in a secondary model that re- 
orders the top N and hopefully improves the quality 
of the top ranked parse. Suppose there exists a "per- 
fect" reranking scheme that, for each sentence, magi- 
cally picks the best parse from the top N parses pro- 
duced by the maximum entropy parser, where the 
best parse has the highest average precision and re- 
call when compared to the treebank parse. The per- 
formance of this "perfect" scheme is then an upper 
bound on the performance of any reranking scheme 
that might be used to reorder the top N parses. Fig- 
ure H shows that the "perfect" scheme would achieve 
roughly 93% precision and recall, which is a dra- 
matic increase over the top 1 accuracy of 87% preci- 
sion and 86% recall. Figure O shows that the "Ex- 
act Match", which counts the percentage of times 
the proposed parse P is identical (excluding POS 
tags) to the treebank parse T, rises substantially 
to about 53% from 30% when the "perfect" scheme 
is applied. For this reason, research into reranking 
schemes appears to be a promising step towards the 
goal of improving parsing accuracy. 

6 Comparison With Previous Work 

The two parsers which have previously reported the 
best accuracies on the Penn Treebank Wall St. Jour- 
nal are the bigram parser described in (Collins, 1996) 
and the SPATTER parser described in (Jelinek et 
al., 1994; Magerman, 1995). The parser presented 
here outperforms both the bigram parser and the 
SPATTER parser, and uses different modelling tech- 
nology and different information to drive its deci- 
sions. 

The bigram parser is a statistical CKY-style chart 
parser, which uses cooccurrence statistics of head- 
modifier pairs to find the best parse. The max- 
imum entropy parser is a statistical shift-reduce 
style parser that cannot always access head-modifier 
pairs. For example, the checkcons(m, n) predicate 
of the maximum entropy parser may use two words 



such that neither is the intended head of the pro- 
posed consituent that the check procedure must 
judge. And unlike the bigram parser, the maximum 
entropy parser cannot use head word information 
besides "flat" chunks in the right context. 

The bigram parser uses a backed-off estimation 
scheme that is customized for a particular task, 
whereas the maximum entropy parser uses a gen- 
eral purpose modelling technique. This allows the 
maximum entropy parser to easily integrate vary- 
ing kinds of features, such as those for punctua- 
tion, whereas the bigram parser uses hand-crafted 
punctuation rules. Furthermore, the customized es- 
timation framework of the bigram parser must use 
information that has been carefully selected for its 
value, whereas the maximum entropy framework ro- 
bustly integrates any kind of information, obviating 
the need to screen it first. 

The SPATTER parser is a history-based parser 
that uses decision tree models to guide the opera- 
tions of a few tree building procedures. It differs 
from the maximum entropy parser in how it builds 
trees and more critically, in how its decision trees 
use information. The SPATTER decision trees use 
predicates on word classes created with a statistical 
clustering technique, whereas the maximum entropy 
parser uses predicates that contain merely the words 
themselves, and thus lacks the need for a (typically 
expensive) word clustering procedure. Furthermore, 
the top K BFS search heuristic appears to be much 
simpler than the stack decoder algorithm outlined 
in (Magerman, 1995). 

7 Conclusion 

The maximum entropy parser presented here 
achieves a parsing accuracy which exceeds the best 
previously published results, and parses a test sen- 
tence in linear observed time, with respect to the 
sentence length. It uses simple and concisely speci- 
fied predicates which can added or modified quickly 
with little human effort under the maximum entropy 
framework. Lastly, this paper clearly demonstrates 
that schemes for reranking the top 20 parses deserve 
research effort since they could yield vastly better 
accuracy results. 
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